The popular ggplot2 package is a powerful and a flexible R package, implemented by Hadley Wickham, for producing elegant graphics. It uses a systematic framework called the grammar of graphics that allows very fine-grained control over how your final product looks.
You will get lots of practice using ggplot during this course. Here, we just review the overall framework so you can get a feel for how ggplot plots are constructed.
There is an entire online book about ggplot2 called ggplot2: Elegant Graphics for Data Analysis by Hadley Wickam. It explains all the different aspects of ggplot2 mechanics and is a great quick reference.
Garrett Grolemund and Hadley Wickham have also compiled a book, R for Data Science, that provides a lot of information about managing and displaying data. It uses the Tidyverse approach to data management, and provides a good introduction to using ggplot2 for data exploration (Chapter 3), and later more detail about using it for communication (Chapter 28):
… and don’t forget the handy Data to Viz and R Graph Gallery websites!
The following brief overview is adapted from an STHDA tutorial you can find here that covers all the basic types of plots you can make with ggplot.
The concept behind ggplot2 divides plot into three different fundamental parts:
Plot = Data + Aesthetics + Geometry
The principal elements of every plot can be defined as follows:
aes() function is used to indicate how to display the data: which categories or measurements to map to x and y coordinates; or color, size or shape of points, etc.There are two major functions in the ggplot2 package:
qplot() stands for quick plot, which can be used to produce easily simple plots.ggplot() function is more flexible and robust than qplot() for building a plot piece by piece.Plots are constructed by layering geometries, additional aesthetics, and themes on top of the primary aesthetic mapping.
The basic syntax is:
ggplot(data = <data.frame>,
mapping = aes(x = <column of data.frame>, y = <column of data.frame>)) +
geom_<type of geometry>()Note that the plus sign, indicating that more lines are to come, must always appear at the end of a line; putting this at the beginning of a line will cause an error.
If your data is tidy, then the columns of your data frame will contain the variables that you want to display. Each of these can be mapped to different aesthetics of the graph (e.g. axis, colors, shapes, etc.). A few of the examples below are based on Chapter 3 from R for Data Science.
There are two ways to specify aesthetics:
aes().aes() directive.Aesthetic elements include things such as:
Here are some examples:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# Mapping data to coordinates (quantitative) and color (qualitative)
ggplot(data = iris,
mapping = aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
geom_point()# Here, putting the color aesthetic inside the geom layer works the same way
# because there is only one geometry on this graph
# We also add a different shape and size, independent of the data
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) +
geom_point(aes(color=Species), shape=18, size=3)# assigning point color independent of data
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) +
geom_point(color="blue")# why doesn't this work?
ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width)) +
geom_point(aes(color="blue"))# Change the size and opacity of points: data are mapped to size, but
# transparency is independent of data
ggplot(data = iris,
mapping = aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
geom_point(alpha = 0.4, aes(size = Petal.Length))R has 25 built-in shapes that are identified by numbers:
The color and fill aesthetics are applied differently to these different shapes, depending on whether they are:
color maps to bordercolor maps to entire shape (border and fill)color maps to border, and fill maps to fillShapes for other types of plots can also have borders and fill designated separately (e.g. barplots, histograms, boxplots, violin plots).
Geometries control the type of visual paradigm you want to use to display your data, for example:
Geom functions also allow you to add additional features to a graph, for example:
Statistical features can also be layered onto graphs, e.g.:
stat = "something" inside another geometry (some examples below)Themes are used to customize the non-data components of your graphs, such as titles, labels, fonts, background, gridlines, and legends. The default appearance of ggplot graphs produces graphs with a gray background and white gridlines. This can be changed to almost any look and feel by customizing their themes, which can also be used to give plots a consistent look for presentation.
In addition to setting theme() components manually, the **ggthemes* package also provides a variety of defined themes that replicate the look and feel for different visual paradigms and applications.
Multiple unrelated graphs can be combined into a single figure using the ggpubr package. This allows you to make publication-ready figures with multiple panels. We show this for a couple of examples below.
A few examples of common plots are illustrated below. Examples for other plot types, including pie charts, line plots, QQ-plots, ECDFs, dendrograms, heat maps, etc. can be found at the Data to Viz website, the R Graph Gallery, or the STHDA tutorial tutorial mentioned above.
If you have a frequency table for one or more categorical variables, you can use a barchart to summarize these data.
Note: It is relatively common to see barplots with with error bars, showing a quantitative variable on the y-axis, but this is discouraged since it masks the true distribution of the data. It is recommended in such cases to use strip charts, boxplots, or violin plots.
avg.sepal.length = iris %>% group_by(Species) %>%
summarize(avg = mean(Sepal.Length), .groups = 'drop')
str(avg.sepal.length)## tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
## $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 2 3
## $ avg : num [1:3] 5.01 5.94 6.59
ggplot(avg.sepal.length, aes(x=Species,y=avg, fill=Species)) +
geom_bar(stat="identity", color="black") +
ylab("Mean Sepal Length (cm)")For two types of categorical variables, a single value for each type category can be displayed either stacked or side-by-side. Note that bars within categories are usually shown juxtaposed to each other, while bars showing data for different categories are shown with spaces in between them.
The options for arranging bars within groups are:
Here we will plot average lengths for the different flower attributes in the iris dataset. To do this we first make a table of the mean measurements across all of the columns:
# get the mean for all four measurements for all 3 species
iris.avg = iris %>% group_by(Species) %>%
summarise_all(list(mean))
iris.avg## # A tibble: 3 x 5
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.01 3.43 1.46 0.246
## 2 versicolor 5.94 2.77 4.26 1.33
## 3 virginica 6.59 2.97 5.55 2.03
This is great, but we have a problem because in order to plot the numerical data grouped by both species and flower attribute, we need to put each of these three “dimensions” into a different column. This means we need to transform our tidy data from a wide format to a long format:
To do this we can use the gather() command from the tidyr package, or the melt() command from the reshape2 package. These end up doing pretty much the same thing, except the gather() command creates an object of class “tibble”, which is just a fancy data frame. You don’t need to worry about this for now.
Note: the opposite of gather() is spread() and the opposite of melt() is cast().
iris.avg.long = gather(iris.avg, key="flower_attr", value="avg", -Species)
str(iris.avg.long)
## tibble [12 × 3] (S3: tbl_df/tbl/data.frame)
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 2 3 1 2 3 1 2 3 1 ...
## $ flower_attr: chr [1:12] "Sepal.Length" "Sepal.Length" "Sepal.Length" "Sepal.Width" ...
## $ avg : num [1:12] 5.01 5.94 6.59 3.43 2.77 ...
head(iris.avg.long)
## # A tibble: 6 x 3
## Species flower_attr avg
## <fct> <chr> <dbl>
## 1 setosa Sepal.Length 5.01
## 2 versicolor Sepal.Length 5.94
## 3 virginica Sepal.Length 6.59
## 4 setosa Sepal.Width 3.43
## 5 versicolor Sepal.Width 2.77
## 6 virginica Sepal.Width 2.97
iris.avg.long2 = melt(iris.avg, variable.name="flower_attr", value.name = "avg")
## Using Species as id variables
str(iris.avg.long2)
## 'data.frame': 12 obs. of 3 variables:
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 2 3 1 2 3 1 2 3 1 ...
## $ flower_attr: Factor w/ 4 levels "Sepal.Length",..: 1 1 1 2 2 2 3 3 3 4 ...
## $ avg : num 5.01 5.94 6.59 3.43 2.77 ...
head(iris.avg.long2)
## Species flower_attr avg
## 1 setosa Sepal.Length 5.006
## 2 versicolor Sepal.Length 5.936
## 3 virginica Sepal.Length 6.588
## 4 setosa Sepal.Width 3.428
## 5 versicolor Sepal.Width 2.770
## 6 virginica Sepal.Width 2.974Now that we have each of our two categorical variables in two different columns, and our average measurement in another column, we can plot all of these different combinations.
# make a barplot by group and measurement type
ggplot(iris.avg.long, aes(x=Species, y=avg, fill=flower_attr)) +
geom_bar(stat="identity") +
ylab("Mean Length (cm)")Oops! This view makes it a bit difficult to compare lengths for each flower attribute across species. Let’s make the chart again, but mapping our variables to different aesthetics:
p.stacked = ggplot(iris.avg.long, aes(x=flower_attr, y=avg, fill=Species)) +
geom_bar(stat="identity") +
ylab("Mean Length (cm)")
p.beside = ggplot(iris.avg.long, aes(x=flower_attr, y=avg, fill=Species)) +
geom_bar(stat="identity", position=position_dodge()) +
ylab("Mean Length (cm)")
ggarrange(p.stacked, p.beside, common.legend = TRUE)Histograms are good for showing the distribution of a single quantitative variable. Two or more distributions can be shown together on one histogram, though showing more than two or three gets really confusion.
Here we illustrate showing the distributions of sepal length for the three different iris species with qplot():
qplot(x = Sepal.Length,
data = iris,
binwidth = 0.2,
fill = Species,
color=Species,
alpha=0.5,
xlab = "Sepal Width (cm)")… and with ggplot():
p.iris = ggplot(iris, aes(x=Sepal.Length, fill=Species, color=Species)) +
geom_histogram(alpha=0.3, binwidth=0.2)
p.irisWe can also decorate the plot with additional information by adding a new geometry:
# compute group means and plot them
p.iris +
geom_vline(data=avg.sepal.length,
aes(xintercept=avg,color=Species),linetype="dashed")Stacked histograms (multiple histograms shown one below the other) can also be used to show comparisons of multiple distributions, though other methods are more often used since they are more compact.
A fancier way to show multiple distributions is implemented by the ggridges package (see this gallery for examples).
These are used to show the relationships between two numerical variables. Points with different shapes and/or colors can also be used to split the data across different categories.
# plot with LOESS best fit (local linear regression line)
lm1 = ggplot(iris, aes(x=Sepal.Length, y=Petal.Length, color=Species)) +
geom_point(alpha=0.5) +
geom_smooth(color="darkgray") # default method is loess
# same plot but overlaid with linear regression lines for each species
lm2 = lm1 +
geom_smooth(method=lm, aes(color=Species, fill=Species),
size=0.5, fullrange=TRUE)
ggarrange(lm1, lm2, common.legend=TRUE)## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Strip charts (a.k.a. dotpolots), boxplots, and violin plots are all good for comparing numerical distributions for multiple categories. Strip charts (dot plots) are good for showing distributions when there are few data points (<20-30); when there are more than this, it’s a better choice to use boxplots and/or violin plots.
Strip charts / dotplots can be drawn in two different ways: using geom_dotplot() or geom_jitter( ), which produce slightly different visual displays. It is also possible to draw statistical summaries on top of them (see the Datanovia tutorial for examples).
# dot plot
dp = ggplot(iris, aes(x=Species, y=Sepal.Length, fill=Species)) +
geom_dotplot(binaxis = "y", stackdir="center", dotsize=.5)
# strip chart
sc = ggplot(iris, aes(x=Species, y=Sepal.Length, color=Species)) +
geom_jitter(position=position_jitter(0.2))
ggarrange(dp, sc, labels=c("A","B"))## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
Below we make boxplots for the same data with different outlines and fill, and use the ggpubr package to arrange them in a figure:
# data=colors(default)
a = ggplot(iris, aes(x=Species, y=Sepal.Length, color=Species)) +
geom_boxplot()
# fill=data + outline=all
b = ggplot(iris, aes(x=Species, y=Sepal.Length, fill=Species)) +
geom_boxplot(color="magenta")
# outline=data + color=manual
c = ggplot(iris, aes(x=Species, y=Sepal.Length, color=Species)) +
geom_boxplot() +
scale_color_manual(values=c("orange", "forestgreen", "purple"))
# fill=data(default) + outline=palette
d = ggplot(iris, aes(x=Species, y=Sepal.Length, fill=Species)) +
geom_boxplot(color="pink") +
scale_color_brewer(palette="Set1")
# ggpubr: arrange and label plots!
ggarrange(a, b, c, d,
labels = c("A", "B", "C", "D"),
ncol = 2, nrow = 2)… and here we show how to combine boxplots with violin plots, which show both distributions and summary statistics all at once:
box = ggplot(iris, aes(x=Species, y=Sepal.Length, color=Species)) +
geom_boxplot()
vln = ggplot(iris, aes(x=Species, y=Sepal.Length, fill=Species)) +
geom_violin()
# both together
bv = ggplot(iris, aes(x=Species, y=Sepal.Length, fill=Species)) +
geom_violin(trim=FALSE) +
geom_boxplot(width=0.1, fill="white")
figure = ggarrange(box, vln, bv,labels = c("A", "B", "C"))
figureSometimes we want to show numerical data separated by category, or split according to multiple categories. We can use facets to show a bunch of related data arranged in multiple panels. See this tutorial from Datanovia for examples.
The following is based on the Facetting chapter from the online ggplot2 book.
Facet plots, also called “lattice” or “trellis” plots, are a powerful tool for exploratory data analysis: you can rapidly compare patterns in different parts of the data and see whether they are the same or different. There are three types of faceting:
facet_null() - a single plot (default)
facet_wrap() - “wraps” a 1D ribbon of panels into 2D (useful if you have a large number of categories)
facet_grid() - produces a 2D grid of panels defined by different variables, which form the rows and columns
This figure from the book illustrates the differences between wrapping and making a grid:
facet_wrap() makes a long ribbon of panels (generated by any number of variables) and wraps it into 2d. This is useful if you have a single variable with many levels and want to arrange the plots in a more space efficient manner.Both of these can be specified with one or two variables defined by a formula (specified with a tilde symbol, ~). The difference is in the specific syntax.
The output will be similar for one variable, but for two variables the axis labels will differ, and and facet_grid() will produce a more sensible representation of the data.
facet_wrap():
(~ a) - spreads the values of a across panels facilitates comparisons of y position, because the vertical scales are aligned.(~ a + b) - spreads the combinations of values for both aand bFor the iris dataset:
base.plot = ggplot(iris, aes(x=Petal.Length, y=Petal.Width, color=Species)) +
geom_point()
base.plotfacet_grid():
(. ~ a) - spreads the values of a across the columns. This direction facilitates comparisons of y position, because the vertical scales are aligned.(b ~ .) - spreads the values of b down the rows. This direction facilitates comparison of x position because the horizontal scales are aligned. This makes it particularly useful for comparing distributions.(a ~ b) - spreads a across columns and b down rows. You’ll usually want to put the variable with the greatest number of levels in the columns, to take advantage of the aspect ratio of your screen.We can easily split our plots by Species to produce the same output as facet_wrap():
However, since the iris dataset contains only one categorical variable, we would have to do something fancier to get good mileage out of facet_grid() for this example.
We can do this by splitting some of the quantitative data into categories (this is kind of a kluge)1. Below we plot Sepal Length against Sepal Width (quantitative variables) split out by Petal type (Long vs. Short, Narrow vs. Wide):
data=iris
data$Petal.Width.Range=factor(ifelse(data$Petal.Width<1.3,"Narrow Petals","Wide Petals"))
data$Petal.Length.Range=factor(ifelse(data$Petal.Length<4.35,"Short Petals","Long Petals"))
ggplot(data, aes(x=Sepal.Length, y=Sepal.Width, color=Species)) +
geom_point(alpha=0.5) +
facet_grid(Petal.Width.Range ~ Petal.Length.Range)You should consult the ggplot2 book chapter for information on customization, such as controlling axis scales, or cutting continuous variables into bins in order to facet them.
We still use the par() command to specify the output file for a graph:
## quartz_off_screen
## 2
*** Author: Kris Gunsalus ***